GEM-PRO - List of Gene IDs¶

This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.

Input: List of gene IDs

Output: GEM-PRO model

Imports¶

In [1]:

import sys
import logging

In [2]:

# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [3]:

# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging¶

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [4]:

# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:

# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project¶

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROJECT will be created
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files

Note: Methods for protein complexes and metabolites are still in development.

In [6]:

# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
PDB_FILE_TYPE = 'mmtf'

In [7]:

# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10: number of genes

Mapping gene ID –> sequence¶

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

Methods¶

GEMPRO.kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:

Download all metadata and sequence files in the sequences directory
Creates a KEGGProp object in the protein.sequences attribute
Returns a Pandas DataFrame of mapping results

Parameters:

kegg_organism_code (str) – The three letter KEGG code of your organism
custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
force_rerun (bool) – If you want to overwrite any existing mappings and files

In [8]:

# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.

Missing KEGG mapping:  []

Out[8]:

	kegg	refseq	uniprot	pdbs	sequence_file	metadata_file
gene
b0761	eco:b0761	NP_415282	P0A9G8	1B9M;1H9S;1B9N;1O7L;1H9R	eco-b0761.faa	eco-b0761.kegg
b0889	eco:b0889	NP_415409	P0ACJ0	2GQQ;2L4A	eco-b0889.faa	eco-b0889.kegg
b0995	eco:b0995	NP_415515	P38684	1ZGZ	eco-b0995.faa	eco-b0995.kegg
b1013	eco:b1013	NP_415533	P0ACU2	4JYK;4XK4;4X1E;3LOC	eco-b1013.faa	eco-b1013.kegg
b1014	eco:b1014	NP_415534	P09546	3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1...	eco-b1014.faa	eco-b1014.kegg

GEMPRO.uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:

model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
- Entrez Gene (GeneID) - P_ENTREZGENEID
- RefSeq Protein - P_REFSEQ_AC
custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
force_rerun (bool) – If you want to overwrite any existing mappings and files

In [9]:

# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()

[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes mapped to UniProt
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.

Missing UniProt mapping:  ['b1130', 'b0889', 'b1221', 'b0761', 'b0995', 'b1187', 'b1013', 'b1299', 'b1014', 'b1040']

[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe

Out[9]:

	uniprot	reviewed	gene_name	kegg	refseq	num_pdbs	pdbs	ec_number	pfam	seq_len	description	entry_date	entry_version	seq_date	seq_version	sequence_file	metadata_file
gene

GEMPRO.set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:	force_rerun (bool) – Set to True to recheck stored sequences

In [10]:

# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.

Missing a representative sequence:  []

Out[10]:

	uniprot	kegg	pdbs	sequence_file	metadata_file
gene
b0761	P0A9G8	eco:b0761	1B9M;1H9S;1B9N;1O7L;1H9R	eco-b0761.faa	eco-b0761.kegg
b0889	P0ACJ0	eco:b0889	2GQQ;2L4A	eco-b0889.faa	eco-b0889.kegg
b0995	P38684	eco:b0995	1ZGZ	eco-b0995.faa	eco-b0995.kegg
b1013	P0ACU2	eco:b1013	4JYK;4XK4;4X1E;3LOC	eco-b1013.faa	eco-b1013.kegg
b1014	P09546	eco:b1014	3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1...	eco-b1014.faa	eco-b1014.kegg

Mapping representative sequence –> structure¶

These are the ways to map sequence to structure:

Use the UniProt ID and their automatic mappings to the PDB
BLAST the sequence to the PDB
Make homology models or
Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods¶

GEMPRO.map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:	seq_ident_cutoff (float) – Sequence identity cutoff in decimal form outdir (str) – Output directory to cache JSON results of search force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:	A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type:	list

In [11]:

# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes with at least one experimental structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe

Out[11]:

GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:

seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
display_link (bool, optional) – Set to True if links to the HTML results should be displayed
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False

In [12]:

# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5: number of genes with additional structures added from BLAST

Out[12]:

	pdb_id	pdb_chain_id	hit_score	hit_evalue	hit_percent_similar	hit_percent_ident	hit_num_ident	hit_num_similar
gene
b0761	1b9n	B	1091.0	5.530720e-119	0.931298	0.931298	244	244
b0761	1o7l	D	1089.0	1.096280e-118	0.931298	0.931298	244	244

Downloading and ranking structures¶

Methods¶

GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:	outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired force_rerun (bool) – If files should be re-downloaded if they already exist

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

In [13]:

# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Saved 15 structures total

Out[13]:

	chemicals	description	experimental_method	mapped_chains	pdb_id	pdb_title	resolution	structure_file	taxonomy_name
gene
b0761	NI	ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU...	X-RAY DIFFRACTION	A;B	1b9m	REGULATOR FROM ESCHERICHIA COLI	1.75	1b9m.mmtf	Escherichia coli
b0761	NI	MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU...	X-RAY DIFFRACTION	A;B	1b9n	REGULATOR FROM ESCHERICHIA COLI	2.09	1b9n.mmtf	Escherichia coli

GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

Homology model(s)

Ranked PDBs

BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:

seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
always_use_homology (bool) – If homology models should always be set as the representative structure
rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
allow_mutants (bool) – If mutations should be allowed or checked for
allow_deletions (bool) – If deletions should be allowed or checked for
allow_insertions (bool) – If insertions should be allowed or checked for
allow_unresolved (bool) – If unresolved residues should be allowed or checked for
clean (bool) – If structures should be cleaned
force_rerun (bool) – If sequence to structure alignment should be rerun

In [14]:

# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[14]:

	id	is_experimental	file_type	structure_file
gene
b0761	REP-1b9n	True	pdb	1b9n-A_clean.pdb
b0889	REP-2gqq	True	pdb	2gqq-A_clean.pdb
b1013	REP-4xk4	True	pdb	4xk4-A_clean.pdb
b1187	REP-1h9t	True	pdb	1h9t-A_clean.pdb
b1221	REP-1rnl	True	pdb	1rnl-A_clean.pdb

In [15]:

# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()

Out[15]:

<StructProp REP-1h9t at 0x7f3880a275c0>

Out[15]:

{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
 'chains': [<ChainProp A at 0x7f38830e1c18>],
 'date': None,
 'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
 'file_type': 'pdb',
 'id': 'REP-1h9t',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '1h9t',
 'resolution': 3.25,
 'structure_file': '1h9t-A_clean.pdb',
 'taxonomy_name': 'ESCHERICHIA COLI'}

Saving your GEM-PRO¶

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

GEMPRO.save_json(outfile, compression=False): Save the object as a JSON file using json_tricks

In [16]:

import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)

[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json